build(medcat and medcat-den): CU-869ddh1jv Avoid test resources in releases by mart-r · Pull Request #503 · CogStack/cogstack-nlp

mart-r · 2026-05-22T11:53:13Z

The underlying issue

medcat-den source distribution are pushed to TestPyPI on every commit. And because they include test-time resources (test / fake models) they are rather large (~32MB). Over time this has meant we've reached PyPI's per project storage limit of 10GB. So now, because of this, medcat-den workflows on the main branch are failing because TestPyPI uploads are failing.

Caveats to consider

The idea of packaging your tests (along with the resources required to run them) is quite common for source distributions. In fact, the default behaviour seems to be to include everything that is tracked by git. There are a number of ways to get around this (i.e removing the files before building, pruning in MANIFEST.in), but they seem to be counter to the open source principles or not really following modern package building standards.

The proposed plan

In order to make this a viable option, I plan to store test time models centrally to the repo. This means that they won't be included in the builds since they're outside the scope of the source. But it also has the added benefit of allowing us to reused the same test models across multiple projects within the repo (e.g medcat and medcat-den, but why not medcat-service as well). On top of that there needs to be a way to access these files from a source distribution. And because that now doesn't include these test-time resources, they need to be fetached. The plan uses pooch to do the fetching from the relevant version on GitHub, but the logic defaults to local files if available. This will involve including these files in relevant releases as well. On the way there we also need to make some changes on the exact paths that are used to interact with these models in the test suite (but that shouldn't be extensive).

This is the plan:

Store test time models centrally
Implement for MedCAT
Implement for MedCAT-den
Implement workflow to add resources to relevant releases

…ching

…sync status

…flow

…-on)

alhendrickson

Makes sense, looks a lot better than before. Will be keen to get medcat-service in on this as well, with the complication that it also wants the models in the docker image.

Pooch looks great as well. I've got a few places (trainer, medcat service k8s) where I need to download assets on startup and might look to use this instead.

alhendrickson · 2026-05-27T10:08:59Z

+    v2_model = "mct2_model_pack.zip"
+
+
+def _get_version(project_name: str = 'medcat') -> str:


As a dumber option as it's all hardcoded anyway - could we just pin an exact stable version instead of using _get_version?

I'd probably rather always pull cogstack-nlp/releases/download/3.11/my_zip reliably, instead of being a dynamic URL to debug later

Right now it kind of looks like the medcat version will change the test model used, but I'd rather make it a deliberate change. I really dont see the test models changing any time soon

I see what you're saying. And I agree that these files are unlikely to change frequently.

But I do think it's realistic that they may be added to at some point.

Hard-coding the version here would then mean we'd need to bump the version every time we do a release.
Realistically, we're never really running the tests in the state that this solution is designed for (i.e an install from source distribution that runs the tests).

alhendrickson · 2026-05-27T10:31:58Z

        with:
          packages_dir: medcat-den/dist
+
+  # test-time models for download


Along with the pin version comment -

Potentially we could do a release/tag for "medcat-test-models" separately, and test models from the medcat/medcat-den versions/releases? (Basically what you did with the release bundle idea).

With the assumption that the test models basically never change. Main advantage is it would probably be easier to reason about "medcat 3 is failing test X, but I know the test model zip is still version 1 and that exact file hasn't changed for a year", but I do see the advantage of versioning it all together as well, I'm sure you've already thought about this one! Definitely hard if the zip in source code changes but the released one doesnt...

If we end up using this for more different parts it may make sense to centralise the entire logic into a installable package.

I think the main thing I want to avoid is yet another release cadence. As much as it's unlikely we'll need to bump this often, having these not tied to the releases of the specific tools that use them would (at least in my eyes) make it harder to track what version of which tool should work with what version of these test time models.

Cool - yeah that's fair for sure that it gets harder to track. I'm all good with this as you have it!

* fix(medcat): CU-869ddh1jv: Fix lock file issue. Issue was introduced in #503. Looks like because I didn't update uv.lock the entire dependency resolution was redone and broke. This PR should - hopefully - fix it. * CU-869ddh1jv: Disallow latest typer version * CU-869ddh1jv: Disallow all recent typer versions --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

github-actions Bot added 9 commits May 22, 2026 11:57

CU-869ddh1jv: Move test models to a separate central folder

807f8f6

CU-869ddh1jv: Add pooch dependency in preparation for centralised fet…

f9d11f0

…ching

CU-869ddh1jv: Add resource fetcher

64f4fd7

CU-869ddh1jv: Add some defensiveness to resource fetcher

befa9f8

CU-869ddh1jv: Create enum for defined resources

23ba05b

CU-869ddh1jv: Allow using defined resource name

a6b960c

CU-869ddh1jv: Change logic when using defined resource name

7578151

CU-869ddh1jv: Use centralised test paths

e0241b9

CU-869ddh1jv: Use centralised path for models in conversion tests

204916c

mart-r marked this pull request as draft May 22, 2026 11:53

github-actions Bot and others added 16 commits May 22, 2026 14:53

CU-869ddh1jv: Add comment regarding duplicate files

158ae94

CU-869ddh1jv: Remove local test-time model packs

679514c

CU-869ddh1jv: Add duplicate resource fetch to medcat-den

370afae

CU-869ddh1jv: Propagate project name in resource fetcher

1efc53a

CU-869ddh1jv: Add small workflow to check test-time resource fetcher …

fe6dd1f

…sync status

CU-869ddh1jv: Use resource fetch at test time for medcat-den

c282dca

CU-869ddh1jv: Add workflow to add test models to releases

54d47ad

CU-869ddh1jv: Add workflow jon to add test models to medcat release

f3f331f

CU-869ddh1jv: Add workflow job to add test models to medcat-den release

95fbd70

CU-869ddh1jv: Fix issue with test utils sync check in workflow

4a623e9

CU-869ddh1jv: Fix test upload step in releas workflow

5391667

CU-869ddh1jv: Fix test upload step comment in medcat-den release work…

9eb5594

…flow

CU-869ddh1jv: Fix test upload step in medcat release workflow

3dde906

CU-869ddh1jv: Fix workflow (add runs-on) for medcat-den

5f21cab

CU-869ddh1jv: Fix workflow for medcat to upload test models (add runs…

18641a5

…-on)

CU-869ddh1jv: Add missing dev-time dependency of pooch to medcat-den

d0d2058

mart-r marked this pull request as ready for review May 22, 2026 15:18

alhendrickson approved these changes May 27, 2026

View reviewed changes

alhendrickson reviewed May 27, 2026

View reviewed changes

mart-r merged commit 8f97c76 into main May 27, 2026
27 checks passed

mart-r deleted the build/medcat-and-den/CU-869ddh1jv-avoid-test-resources-in-releases branch May 27, 2026 11:42

mart-r mentioned this pull request May 27, 2026

fix(medcat): CU-869ddh1jv: Fix lock file issue. #505

Merged

mart-r mentioned this pull request May 27, 2026

build(medcat-den and embedding-linker): CU-869ddh1jv Fix downstream installs #507

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build(medcat and medcat-den): CU-869ddh1jv Avoid test resources in releases#503

build(medcat and medcat-den): CU-869ddh1jv Avoid test resources in releases#503
mart-r merged 25 commits into
mainfrom
build/medcat-and-den/CU-869ddh1jv-avoid-test-resources-in-releases

mart-r commented May 22, 2026 •

edited

Loading

Uh oh!

alhendrickson left a comment •

edited

Loading

Uh oh!

alhendrickson May 27, 2026

Uh oh!

mart-r May 27, 2026

Uh oh!

alhendrickson May 27, 2026

Uh oh!

mart-r May 27, 2026

Uh oh!

alhendrickson May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		v2_model = "mct2_model_pack.zip"


		def _get_version(project_name: str = 'medcat') -> str:

Conversation

mart-r commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The underlying issue

Caveats to consider

The proposed plan

Uh oh!

alhendrickson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alhendrickson May 27, 2026

Choose a reason for hiding this comment

Uh oh!

mart-r May 27, 2026

Choose a reason for hiding this comment

Uh oh!

alhendrickson May 27, 2026

Choose a reason for hiding this comment

Uh oh!

mart-r May 27, 2026

Choose a reason for hiding this comment

Uh oh!

alhendrickson May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mart-r commented May 22, 2026 •

edited

Loading

alhendrickson left a comment •

edited

Loading